Chapter 3 Exploratory Data Analysis
3.1 Start with dplyr counts and summaries in console
David Robinson first explores new data with simple counts in the console.
Here we don’t use the package name (so breaking the rule I just told you) so we can quickly explore the data by typing dplyr verbs quickly
df %>% count(city) %>% View()
df %>% count(city, year, month) %>% View()#
df %>% group_by(city) %>% summarise(vol_max = max(volume, na.rm = T)) %>% arrange(desc(vol_max)) %>% View()
3.2 Plot data points with geom_point()
After using count(), group_by() and summarise() plot all data points with ggplot2::geom_point(). It almost NEVER fails to show you what’s going on and is unlikely to return errors.
This is the minimum and most reliable ggplot code to start with. Let’s look at all the values of sales for each date.
## Warning: Removed 568 rows containing missing values (geom_point).

- Then look at sales over the values of any other dimensions. There is one other dimension city.
## Warning: Removed 568 rows containing missing values (geom_point).

But those points look a bit crowded. Whenever the dots overlap replace geom_point() with geom_jitter().
And we make the dots lighter using a non-intuitive parameter called alpha.
## Warning: Removed 568 rows containing missing values (geom_point).

Of course we know sales of most things vary by season. Let’s put date on the x axis, make city the colour, and because the data is over time we can join those dots using ggplot2::geom_line()
We’re also using the reduced data set so it’s not too crowded for now.
df_red %>%
ggplot2::ggplot() +
ggplot2::aes(
x = date,
y = sales,
colour = city
) +
ggplot2::geom_line()## Warning: Removed 1 rows containing missing values (geom_path).

- Beautiful, while sales have very different volumes between cities we can see they tightly follow the same seasonal pattern. But the are on different scales so harder to compare the patterns. One option Wickham does is to log transform the sales value.
df_red %>%
ggplot2::ggplot() +
ggplot2::aes(
x = date,
y = base::log(sales),
colour = city
) +
ggplot2::geom_line()## Warning: Removed 1 rows containing missing values (geom_path).

3.3 Facet by categories
Another logical step after showing categories by colour is to use “small multiples”. This is a fancy way of saying draw a chart for each category and look at them all at once in a grid. An important setting here is to specify scales = “free” so they are their own scale and we can study what’s going on in each city.
This lets us more easily spot interesting differences in the seasonal pattern between cities.
df_red %>%
ggplot2::ggplot() +
ggplot2::aes(
x = date,
y = sales,
colour = city
) +
ggplot2::geom_line() +
ggplot2::facet_wrap(~city,
scales = "free"
)## Warning: Removed 1 rows containing missing values (geom_path).

3.4 Facet interactively (trelliscopejs)
- An interactive way to facet (or create small multiples) that allows interactive data exploration is trelliscopejs. Here we look at all the US cities facetted by city in a trelliscope web page. Have a play on this below and see what it does.
3.5 Or loop to plot every category seperately
- Or to really study each chart, nest the data into a data frame of dataframes for each city. Then loop through each one and creating a plot in the data frame we plot.
df_red_nest_plot <-
df_red_nest %>%
dplyr::mutate(plot = purrr::map2(
.x = data,
.y = city,
~ ggplot2::ggplot(
data = .x,
aes(
x = date,
y = sales
)
) +
ggtitle(glue("Plot of {.y}")) +
geom_line()
))## [[1]]

##
## [[2]]

##
## [[3]]

##
## [[4]]

##
## [[5]]
## Warning: Removed 1 rows containing missing values (geom_path).

##
## [[6]]

##
## [[7]]

##
## [[8]]

##
## [[9]]

##
## [[10]]

##
## [[11]]

##
## [[12]]

##
## [[13]]

##
## [[14]]

3.6 Polish your final plot
We now have a bare minimum Exploratory Data Analysis toolkit of how to explore the data from the console using View(), and then looking at the data points, followed by some line plots.
We could soon be ready to decide on the plot we want that tells and interesting story. But adding in all the bells and whistles to make it ready for a customer or a publication can take ages. It shouldn’t be part of your exploratory data analysis.
Also, we should use a code style recommended before that lays out your code cleanly. It’s far quicker then to comment out or tweak the values of each part of your plot until it looks just right.
I won’t explain each line below other than to say you can run it in chunks to understand it like the popular ggplot flip-books.
# a list of dates to add vertical lines to the plot
years <- base::seq.Date(
from = as.Date("2000-01-01"),
to = as.Date("2015-01-01"),
by = "years"
)
df %>%
ggplot2::ggplot() +
ggplot2::aes(
x = date,
y = sales,
colour = city
) +
ggplot2::geom_line(size = 1) +
ggplot2::theme_minimal() +
gghighlight::gghighlight(base::max(sales) > 5000, # highlight only cities with higher sales
label_params = list(size = 4)
) +
ggplot2::scale_y_continuous(labels = scales::comma) +
ggplot2::scale_x_date(
date_breaks = "1 year",
labels = scales::date_format("%b %Y"),
limits = c(
as.Date("2000-01-01"),
as.Date("2015-07-01")
)
) +
ggplot2::labs(
title = "US Housing Sales",
subtitle = "US cities with more than 5,000 sales in any month",
caption = "Source: ggplot2 built in txhousing data set",
x = "Month",
y = "Volume of Sales"
) +
ggplot2::geom_vline(
xintercept = years,
linetype = 4
) +
ggplot2::theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
strip.text.x = element_text(size = 10),
axis.text.x = element_text(
angle = 60,
hjust = 1,
size = 9
),
legend.text = element_text(size = 12),
legend.position = "right",
legend.direction = "vertical",
plot.title = element_text(
size = 22,
face = "bold"
),
plot.subtitle = element_text(
color = "grey",
size = 18
),
plot.caption = element_text(
hjust = 0,
size = 12,
color = "darkgrey"
),
legend.title = element_blank()
)## label_key: city
## Warning: Removed 430 rows containing missing values (geom_path).

So this isn’t necessarily a good plot. There’s things wrong with it I expect you’ll want to change. But with this clear ladder of code you can more quickly read, edit, comment chunks out, or run in chunks from the top down.